Trainable, Scalable Summarization Using Robust NLP and Machine Learning

نویسندگان

Chinatsu Aone

Mary Ellen Okurowski

James Gorlinsky

چکیده

We describe a trainable and scalable summarization system which utilizes features derived from information retrieval, inibrmation extraction, and NLP techniques and on-line resources. The system con> bines these features using a trainable feature combiner learned from summary examples through a machine learning algorithm. We demonstrate system scalability by reporting results on the best combination of summarizat ion features for different document sources. We also present preliminary results from a task-based evaluation on summarization outpnt usability. 1 I n t r o d u c t i o n Frequency-based (Edmundson, 196(.); Kupiec, Pedersen, and Chen, 1995; Brandow, Mitze. and Rau, 1995), knowledge-based (Reimer and Hahn, 1988; McKeown and Radev, 1995), and discoursebased (Johnson et al., 1993; Miike et al., 1994; Jones, 1995) approaches to automated summarization correspond to a continuum of increasing understanding of the text and increasing complexity in text processing. Given the goal of machine-generated summaries, these approaches a t tempt to answer three central questions: • How does the system count words to calculate worthiness for summarization? • How does the system incorporate the knowledge of the domain represented in tile text? • How does the system create a coherent and cohesive summary? Our work leverages off of research in these three approaches and a t tempts to remedy some of the difficulties encountered in each by applying a combination of information retrieval, information extraction, *We would like to thank Ja.mie Callan for his help with the INQUERY experiments. and NLP techniques and on-line resources with nmchine learning to generate summaries. Our DimSum system follows a common paradigm of sentence extraction, but automates acquiring candidate knowledge and learns what knowledge is necessary to sun> inarize. We present how we automatically acquire caudidate features in Section 2. Section 3 describes our training methodology for combining features to generate summaries, and discusses evaluation results of both batch and machine learning methods. Section 4 reports our task-based evalnation. 2 E x t r a c t i n g F e a t u r e s Ill this section, we describe how the system counts linguistically-motivated, autornaticallyderived words and nmlti-words in calculating worthiness for smnmarizat.ion. We show how tile systetll uses an external corpus t.o incorporate domain knowledge in contrast to text-only statistics. Finally, we explain how we a t tempt to increase the co hesiveness of our summaries by using name aliasing, WordNet synonyms, and morphological variants. 2.1 D e f i n i n g Single a n d M u l t i w o r d T e r m s Frequency-based summarizat ion systems typically use a single word string as the unit for counting fl'equency. Though robust, such a method ignores the semantic content of words and their potential men> bership in multi-word phrases and may introduce noise in frequency counting by treating the same strings uniformly regardless of context. Our approach, similar to (Tzoukerman, Klavans, and aacquemin, 1997), is to apply NLP tools to extract multi-word phrases automatically with high accuracy and use them as the basic unit in the summarization process, including frequency calculation. Our system uses both text statistics (term frequency, or /.at) and corpus statistics (inverse document frequency, or idJ) (Salton and McGill, 1983) to derive sigTzal~zrc words as one of the sunmlarization features. If single words were the sole basis of counting for our summarizat ion application, noise would be

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Trainable Multi-document Summarizer

This paper describes an approach to building a trainable multi-document summarization system, using a simple training process based on support vector machines. The summarization system is trained and tested using the DUC 2005 data set. The evaluation results based on ROUGE scores are presented and methods for improving the performance of the summarization system are identified.

متن کامل

Automatic Text Summarization Using a Machine Learning Approach

In this paper we address the automatic summarization task. Recent research works on extractive-summary generation employ some heuristics, but few works indicate how to select the relevant features. We will present a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text. These features a...

متن کامل

BUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULE- BASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY by

This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes’ theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the tagger’s accuracy. The tagge...

متن کامل

Efficient calculation of sentence semantic similarity: a proposed scheme based on machine learning approaches and NLP techniques

Aim of Study Sentence semantic similarity plays a crucial role in a variety of applications such as Machine Translation, Information Retrieval, Question Answering and Multi-document Summarization. Considering the variability of natural language expression, sentence semantic similarity detection is not a trivial task. This paper tries to make use of Natural Language Processing (NLP) as well as m...

متن کامل

545 Machine Learning , Fall 2011 Final Project

This project aims at applying neural network-based deep learning to the problem of extractive text summarization. Our work is inspired by the work of Collobert and Weston [Collobert et al., 2011], who created a unified deep learning architecture to learn several common NLP tasks. In this report, we give the motivation behind our work, describe our problem formulation and present some results.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Trainable, Scalable Summarization Using Robust NLP and Machine Learning

نویسندگان

چکیده

منابع مشابه

Building a Trainable Multi-document Summarizer

Automatic Text Summarization Using a Machine Learning Approach

BUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULE- BASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY by

Efficient calculation of sentence semantic similarity: a proposed scheme based on machine learning approaches and NLP techniques

545 Machine Learning , Fall 2011 Final Project

عنوان ژورنال:

اشتراک گذاری